BBC News Classification

There are 3 datasets provided for this project: Sample Solution, Testing, and Trainining data. I will first import all the data.

The solution dataset shows what the solution should look like

Step 1: Extracting word features and show Exploratory Data Analysis (EDA) — Inspect, Visualize and Clean the Data

In this section, I will share visualizations and describe data cleaning procedures.

The solution dataset has 735 non-null values (rows) and 2 columns. The columns are Article ID which refers to an article and Category which is the category of the article.

The test dataset also has 735 non-null values and 2 columns. The columns are Article ID and Text which is the actual text of the article.

The training dataset has 1490 non-null values and 3 columns which are ArticleID, Text of article, and Category of article. The training data has an extra Category column which tells us what the category for each article is.

Check for empty string values in columns:

None of the tables seem to have null values.

Visualizations:

I will first view the distribution of how many articles are in each category.

From the plot above, it can be seen that the categories sport and business have the most number of articles and others have less, with tech having the lowest number of articles.

Next, I want to visualize the length of the text column and understand what the distribution is.

From the plot above, it can be seen that there somewhat of a normal distribution of the text length, but it is skewed.

Data Cleaning & Text Processing

Next I will clean the text data and then process it using TF-IDF

In the example above, there are lots of punctions so I will remove them using regex.

As can be seen above, with punctuation removed it will be easier to model.

Next, I will create a function to remove stopwords:

I will be using the TD-IDF to process the text. It's widely used fot text mining and classification. It is a method for creating document-term matrices. It's similar to bagofwords, but it stores a measure of the relevance of every word in each document by reweighing the counts. TD-IDF can be broken down as follows:

  1. Term Frequency (TF) - number of times a given word appears in a document.
  2. Inverse Document Frequency (IDF) - Inverse of the number of documents a given word appears in.

Step 2: Building and training models

1. Think about this and answer: when you train the unsupervised model for matrix factorization, should you include texts (word features) from the test dataset or not as the input matrix? Why or why not?

No, it's not advised to include test dataset in any capacity. It results in data leakage such that if test data is exposed to the model it will lead to overfitting. It's also not good practice because one should always train their model on training dataset and the purpose of the test data is to only use for testing, which is a fair way to evaluate the model.

2. Build a model using the matrix factorization method(s) and predict the train and test data labels.

I will be using Non-Negative Matrix Factorization.

  1. Measure the performances on predictions from both train and test datasets. You can use accuracy, confusion matrix, etc., to inspect the performance. You can get accuracy for the test data by submitting the result to Kaggle.

4) Change hyperparameter(s) and record the results. We recommend including a summary table and/or graphs. 5) Improve the model performance if you can- some ideas may include but are not limited to; using different feature extraction methods, fit models in different subsets of data, ensemble the model prediction results, etc.

Based on this plot, we can say C=1000 is the model with the best fit.

Using the best parameters from the table above, we can build a model:

Using hyperparameter optimization, the accuracy score has increased to 1. This is not super great as it means the model is perhaps overfitting.

Test accuracy has stayed the same.

Step 3: Compare with supervised learning

1) Pick and train a supervised learning method(s) and compare the results (train and test performance)

For my supervised learning model, I am choosing KNeighborsClassifier as it's suitable for classification and it's simple to implement.

2) Discuss comparison with the unsupervised approach. You may try changing the train data size (e.g., Include only 10%, 20%, 50% of labels, and observe train/test performance changes). Which methods are data-efficient (require a smaller amount of data to achieve similar results)? What about overfitting?

For this question, I only decieded to do 50% of the data. It can be seen that test data accuracy is better while the train accuracy is not overfitting so I think supervised learning is better. In terms of whether model is data-efficient, I do think that supervised is faster because with unsupervised learning, it's trying to make sense of text data.

References: